Fault Tolerant PVM Application ? — A Case Study with D - CICADA ( FT )
نویسنده
چکیده
The ever increasing number of computers connected into network leads to the development of new programming techniques to allow full use of the potential of these virtual computers. Despite problems of programming difficulty, associated with the distributed computing environment, the problem of fault tolerance is also emerging. The distributed computing environment is often used as a low-cost substitute of fast (super)computers. However, the very nature of these virtual computers, created from dozens of ordinary workstations in local or metropolitan area networks, is not really suitable for long time tasks. Congestion in the network, software and hardware crashes, broken connections etc are all the reasons while the efficient use of the virtual computer is further complicated. To ease its use, modified and/or new tools and techniques must be provided, both at the level of distributed computing environment and at the level of programming methodology. In this article, D-CICADA(FT), a robust implementation of the D-CICADA system [1] is presented. It is shown how careful design may lead to robust (and almost fault tolerant) system.
منابع مشابه
FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study
There is a growing interest in deploying MPI over very large numbers of heterogenous, geographically distributed resources. FT-MPI provides the fault-tolerance necessary at this scale, but presents some issues when crossing multiple administrative domains. Using the H2O metacomputing framework, we add cross-administrative domain interoperability and pluggability to FT-MPI. The latter feature al...
متن کاملFT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Initial versions of MPI were designed to work efficiently on multiprocessors which had very little job control and thus static process models, subsequently forcing them to support dynamic process operations would have effected their performance. As current HPC systems increase in size with higher potential levels of individual node failure, the need rises for new fault tolerant systems to be de...
متن کاملFTOP: A Library for Fault Tolerance in a Cluster
Checkpointing and rollback recovery is a simple technique for fault tolerance. The state of a process is saved on a disk file from which the process can recover on the occurrence of failure. In this paper we describe the implementation of FTOP (Fault Tolerant PVM), a coordinated checkpointing library integrated with PVM. Existing PVM applications require only minor change for incorporating faul...
متن کاملFault-Tolerant RT-Mach (FT-RT-Mach) and an Application to Real-Time Train Control
Even though real-time systems have the stringent constraint of completing tasks before their deadlines, many existing real-time operating systems do not implement fault tolerance capabilities. In this paper we summarize fault tolerant real-time scheduling policy for dynamic tasks with ready times and deadlines. Our focus in this paper is the implementation, which includes fault-tolerant schedul...
متن کاملApplication Recovery in Parallel Programming Environment
In this paper, fault-tolerant feature of TOPAS parallel programming environment for distributed systems is presented. TOPAS automatically analyzes data dependence among tasks and synchronizes data, which reduces the time needed for parallel program developments. TOPAS also provides supports for scheduling, load balancing and fault tolerance. The main topics of this paper is to present the solut...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994